Multilingual Topic Detection Using a Parallel Corpus
نویسندگان
چکیده
We have developed an approach for topic detection from multilingual news, in particular Chinese and English. We extract named entities such as people names, geographical location names, and organization names automatically from the news content by transformation-based linguistic taggers. These sets of named entities together with the remaining content terms form the basis of news representation. Gross translation of Chinese story representation into English is conducted using easily available resources. We have investigated two approaches for gross translation. One is a basic method using only a bilingual dictionary. The second approach makes use of a parallel corpus as an additional resource. The topic discovery task uses a modified agglomerative clustering algorithm to group stories. One difference between our clustering approach and the standard agglomerative one is that we maintain three kinds of elements in the clustering process, namely, story, temporary clusters, and final clusters.
منابع مشابه
Using Multilingual Topic Models for Improved Alignment in English-Hindi MT
Parallel corpora are often injected with bilingual dictionaries for improved Indian language machine translation (MT). In absence of such dictionaries, a coarse dictionary may be required. This paper demonstrates the use of a multilingual topic model for creating coarse dictionaries for English-Hindi MT. We compare our approaches with: (a) a baseline with no additional dictionary injection, and...
متن کاملMultilingual Relevant Sentence Detection Using Reference Corpus
IR with reference corpus is one approach when dealing with relevant sentences detection, which takes the result of IR as the representation of query (sentence). Lack of information and language difference are two major issues in relevant detection among multilingual sentences. This paper refers to a parallel corpus for information expansion and translation, and introduces different representati...
متن کاملThat'll Do Fine!: A Coarse Lexical Resource for English-Hindi MT, Using Polylingual Topic Models
Parallel corpora are often injected with bilingual lexical resources for improved Indian language machine translation (MT). In absence of such lexical resources, multilingual topic models have been used to create coarse lexical resources in the past, using a Cartesian product approach. Our results show that for morphologically rich languages like Hindi, the Cartesian product approach is detrime...
متن کاملDiscovering Parallel Text from the World Wide Web
Parallel corpus is a rich linguistic resource for various multilingual text management tasks, including crosslingual text retrieval, multilingual computational linguistics and multilingual text mining. Constructing a parallel corpus requires effective alignment of parallel documents. In this paper, we develop a parallel page identification system for identifying and aligning parallel documents ...
متن کاملAutomatic Dictionary Construction and Identification of Parallel Text Pairs
When creating dictionaries for use in for example cross-language search engines, parallel or comparable text pairs are needed. Multilingual web sites may contain parallel texts but these can be difficult to detect. For instance, a multilingual website, Hallå Norden, contains information in five languages; Swedish, Danish, Norwegian, Icelandic and Finnish. Working with these texts we discovered ...
متن کامل